36 research outputs found

    Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter

    Full text link
    Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC's Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter's lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation

    PaST-NoC: A Packet-Switched Superconducting Temporal NoC

    Full text link
    Temporal computing promises to mitigate the stringent area constraints and clock distribution overheads of traditional superconducting digital computing. To design a scalable, area- and power-efficient superconducting network on chip (NoC), we propose packet-switched superconducting temporal NoC (PaST-NoC). PaST-NoC operates its control path in the temporal domain using race logic (RL), combined with bufferless deflection flow control to minimize area. Packets encode their destination using RL and carry a collection of data pulses that the receiver can interpret as pulse trains, RL, serialized binary, or other formats. We demonstrate how to scale up PaST-NoC to arbitrary topologies based on 2x2 routers and 4x4 butterflies as building blocks. As we show, if data pulses are interpreted using RL, PaST-NoC outperforms state-of-the-art superconducting binary NoCs in throughput per area by as much as 5x for long packets.Comment: 14 pages, 18 figures, 2 tables. In press in IEEE Transactions on Applied Superconductivit

    Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

    Full text link
    The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202

    Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization

    Full text link
    Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a methodology to quantitatively assess the effectiveness of the ISA to encode quantum circuits for intermediate-scale quantum devices with O(10210^2) qubits. The characterization model that we define reflects performance, the ability to meet timing constraint implications, scalability for future quantum chips, and other important considerations making them useful guides for future designs. Using our methodology, we propose scalar (QUASAR) and vector (qV) quantum ISAs as extensions and compare them with other ISAs in metrics such as circuit encoding efficiency, the ability to meet real-time gate cycle requirements of quantum chips, and the ability to scale to more qubits.Comment: 10 pages, 8 figure

    Pre-Configured Routes

    No full text
    In multi-core ASICs, processors and other compute engines need to communicate with memory blocks and other cores with latency as close as possible to the ideal of a direct buffered wire. However, current state of the art networks-on-chip (NoCs) suffer, at best, latency of one clock cycle per hop. We investigate the design of a NoC that offers close to the ideal latency in some preferred, run-time configurable paths. Processors and other compute engines may perform network reconfiguration to guarantee low latency over different sets of paths as needed. Flits in non-preferred paths are given lower priority than flits in preferred paths to enable the latter to provide low latency. To achieve our goal, we extend the “mad-postman ” technique [1]: every incoming flit is eagerly (i.e. speculatively) forwarded to the input’s preferred output, if any. This is accomplished with the mere delay of a single pre-enabled tri-state driver. We later check if that decision was correct, and if not, we forward the flit to the proper output. Incorrectly forwarded flits are classified as dead, and are eliminated in later hops

    Collective Memory Transfers for Multi-Core Chips

    No full text
    corecore